Conversation
Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
6ab54d1 to
199a712
Compare
| vm: | ||
| name: VM Checks | ||
| runs-on: build-amd64 |
There was a problem hiding this comment.
maybe for now we can leave these out. we'll add tests once we get things over to the driver architecture.
There was a problem hiding this comment.
same as above. i think we stash this. once we land the initial implementation lets add the vm to our e2e tests.
There was a problem hiding this comment.
nit: we dont want to check this in. i capture the plan for posterity under OS-53.
| # ./build-rootfs.sh [--base] [--arch aarch64|x86_64] [output_dir] | ||
| # ./build-rootfs.sh [--base] [--gpu] [--arch aarch64|x86_64] [output_dir] |
There was a problem hiding this comment.
Will we need to publish two rootfs'?
There was a problem hiding this comment.
Yes — the plan is two rootfs variants: a standard CPU rootfs (the current default) and a GPU rootfs built with --gpu. The GPU variant bakes in the NVIDIA driver packages (~550MB+) and nvidia-container-toolkit, which are required for the guest init to succeed with GPU passthrough. Keeping them separate avoids bloating the default image for non-GPU users. The launcher auto-selects the GPU rootfs when --gpu is passed (fixing that selection logic is part of the Codex P1 feedback in this PR). Long-term we could explore a layered approach, but two distinct builds is the simplest path for now.
posted via cursor for vince-brisebois
There was a problem hiding this comment.
plan here is to ship a single kernel for both gpu and cpu? (i think this makes sense for now, just checking)
There was a problem hiding this comment.
Yep, single kernel for both. The GPU-related kconfig additions (CONFIG_PCI, CONFIG_PCI_MSI, CONFIG_DRM, CONFIG_MODULES, CONFIG_MODULE_UNLOAD) are harmless on non-GPU boots — the kernel just probes and finds no PCI devices. The size overhead is negligible since these are small built-in drivers, not loadable modules. Avoids the complexity of maintaining and distributing two separate kernel builds.
posted via cursor for vince-brisebois
There was a problem hiding this comment.
Not GPU-related — this was a test isolation fix for OPENSHELL_COMMUNITY_REGISTRY that got bundled into the feature commit by accident. The bare_name_expands_to_community_registry test could flake if the env var is set externally.
Reverted from this branch. Will submit separately as a test(core): fix env isolation in image resolution tests PR so it doesn't muddy this diff.
posted via cursor for vince-brisebois
There was a problem hiding this comment.
we want to avoid adding the whole vm dep here if possible. lets talk about this more since i don't know if it's straightforward to remove or not.
|
Also ran this through codex review, its feedback looks pretty good: The patch introduces two functional blockers in the new GPU path: the gateway CLI drops the VFIO bind guard as soon as deployment returns, and openshell-vm --gpu still boots the non-GPU rootfs by default. It also leaves host IPv4 Full review comments:
|
…a unbind hardening Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
199a712 to
a9491fd
Compare
Summary
Adds VFIO GPU passthrough support to openshell-vm using cloud-hypervisor as a second VMM backend alongside libkrun. Includes a full GPU bind/unbind lifecycle with safety checks, nvidia driver deadlock hardening (subprocess isolation with timeout, pre-unbind module cleanup, post-timeout verification), and an RAII guard that restores the original driver on exit.
Related Issue
N/A
Changes
Testing
mise run pre-commitpassesChecklist